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Abstract 

The Drosophila Y chromosome is a degenerated, heterochromatic chromosome with few functional genes. Despite this, natural 
variation on the Y chromosome in D. melanogaster has substantial frans-acting effects on the regulation of X-linked and autosomal 
genes. It is not clear, however, whether these genes simply represent a random subset of the genome or whether specific functional 
properties are associated with susceptibility to regulation by Y-linked variation. Here, we present a meta-analysis of four previously 
published microarray studies of Y-linked regulatory variation (YRV) in D. melanogaster. We show that YRV genes are far from a 
random subset of the genome: They are more likely to be in repressive chromatin contexts, be expressed tissue specifically, and vary in 
expression within and between species than non-YRV genes. Furthermore, YRV genes are more likely to be associated with the 
nuclear lamina than non-YRV genes and are generally more likely to be close to each other in the nucleus (although not along 
chromosomes). Taken together, these results suggest that variation on the Y chromosome plays a role in modifying how the genome 
is distributed across chromatin compartments, either via changes in the distribution of DNA-binding proteins or via changes in the 
spatial arrangement of the genome in the nucleus. 
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Introduction 

The Drosophila Y chromosome, despite comprising approxi- 
mately 20% of the male genome in Drosophila melanogaster, 
contains fewer than 20 genes, primarily with specialized male 
reproductive functions (Gatti and Pimpinelli 1983; Bonaccorsi 
and Lohe 1991 ; Carvalho et al. 2000, 2001, 2003; Koerich et 
al. 2008; Vibranovski et al. 2008; Krsticevic et al. 2010). Most 
of the chromosome consists of megabase-sized blocks of 
repetitive DNA, including sequences derived from transpos- 
able elements, large microsatellite blocks, and the Y-linked 
ribosomal DNA (rDNA) array, bobbed (Gatti and Pimpinelli 
1983; Bonaccorsi and Lohe 1991; Lohe et al. 1993). 
Although it has long been known that the Y chromosome is 
essential for male fertility (Brosseau 1960), until recently the 
Drosophila Y chromosome was not thought to have any other 
significant functions or harbor significant variation among 
or across populations. In the past decade, however, evidence 
has emerged that genetic variation on the Y chromosome 
is associated with variation in a number of traits, including 



overall male fitness (Chippindale and Rice 2001), sensitivity 
of spermatogenesis to thermal stress (Rohmer et al. 2004; 
David et al. 2005), geotaxis (Stoltenberg and Hirsch 1 997), 
and position effect variegation (PEV; Lemos et al. 2010). 
Despite these observations, a mechanistic basis for widespread 
phenotypic effects of Y-linked variation has remained elusive. 

One potential mechanism is the phenomenon of Y-linked 
regulatory variation (YRV). Variation on the Y chromosome in 
both D. melanogaster and D. simulans is associated with var- 
iation in expression of autosomal and X-linked genes (Lemos 
et al. 2008; 2010; Jiang et al. 2010; Sackton et al. 201 1). By 
introgressing Y chromosomes from a variety of D. melanoga- 
ster stocks into a common laboratory background, Lemos 
et al. (2008) demonstrated that hundreds of genes vary in 
expression in males across lines that differ only in the popula- 
tion of origin of their Y chromosome, whereas no genes vary 
in expression across females of the same lines. Subsequently, it 
has been shown that expression of Y-linked protein-coding 
genes plays at most a minor role in YRV: Because sex 
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determination in Drosophila is based on the number of 
X chromosomes, individuals with an XX/Y genotype are 
female, and can be constructed using standard D. melanoga- 
ster stocks. These females do not transcribe genes from their 

Y chromosome but still show frans-acting effects of Y-linked 
variation on gene expression (Lemos et al. 2010). YRV is also 
subject to significant Y-by-background epistatic effects (Jiang 
et al. 2010) and at least partially attributable to variation in 
rDNA content on the Y, as YRV is observed among mutant 

Y chromosomes that vary only in rDNA content (Paredes et al. 

2011) . Furthermore, Y chromosome divergence between 
D. simulans and D. sechellia is also associated with gene 
expression changes (Sackton et al. 201 1). 

The phenomenon of YRV implies that the Y chromosome 
interacts with the rest of the genome in previously unantici- 
pated ways to modify gene expression patterns. However, 
theoretical predictions and empirical studies suggest that 
genetic variation on the Y chromosome should be low relative 
to the autosomes and the X chromosome (Clark 1987, 1990; 
Clark and Lyckegaard 1990; Bachtrog and Charlesworth 
2002; Bachtrog 2005, 2006; Kaiser and Charlesworth 
2010), and recent sequencing results in D. melanogaster 
and other species have confirmed this expectation 
(Zurovcova and Eanes 1999; Kopp, Frank, and Barmina 
2006; Kopp, Frank, and Fu 2006; Larracuente and Clark 

2012) . Some evidence hints that this result may be limited 
to single-nucleotide polymorphisms, however, and that struc- 
tural variation may be more prevalent. Multiple cytologically 
distinguishable forms of the Y chromosome segregate in 
at least some species of Drosophila (Dobzhansky 1935), and 
variation in rDNA array size and other repetitive sequence 
blocks exists (Lyckegaard and Clark 1989; Clark and 
Lyckegaard 1990). Structural variation in the size of the 

Y chromosome, not single-nucleotide polymorphism, is asso- 
ciated with variation in PEV in strains of D. melanogaster with 
varying amounts of the Y chromosome fused to the X chro- 
mosome (Dimitri and Pisano 1989). A reasonable hypothesis, 
therefore, is that variation across Y chromosomes in the type, 
amount, and distribution of repetitive DNA has frans-acting 
effects on gene expression in the genome. 

However, clear evidence for this hypothesis is difficult to 
obtain. It is still unclear what varies across Y chromosomes and 
how exactly that variation mechanistically affects non-Y-linked 
gene expression. Although characterizing variation, and 
especially structural variation, on the Y chromosome remains 



quite challenging, we can gain insight into the basis for YRV 
by examining the properties of the set of genes that appear to 
be regulated by Y-linked variation. We have observed YRV in a 
range of conditions, but both whether a common set of genes 
regulated by Y chromosome variation across genetic back- 
grounds exists and the extent to which genes regulated by 

Y chromosome variation share common sets of genomic cor- 
relates (which might predict something about the mechanistic 
basis for this phenomenon) remain unclear. 

To begin to address these questions, we have taken a 
meta-analytic approach to combine data from a series of pub- 
lished microarray studies (table 1). We estimated robust effect 
sizes and combined probabilities of Y regulation across stud- 
ies, which reveal patterns not apparent from the analysis of 
individual data sets. From this analysis, we first address the 
extent to which different studies reveal a common set of 
underlying YRV genes and then address whether there are 
underlying genomic properties that predict membership in 
the YRV gene class. We show that, indeed, there is a 
common class of genes that vary in expression consistently 
across multiple Y introgression experiments. These genes are 
more likely than non-YRV genes to be tissue biased in expres- 
sion, localized to repressive chromatin, and vary in expression 
within and among species. Taken together, these results pro- 
vide evidence for the hypothesis that differences among 

Y chromosomes modify the distribution of genes in active 
and repressive chromatin across the rest of the genome. 

Materials and Methods 

Defining a YRV Gene Set 

To identify a common set of YRV genes across studies, we first 
selected experiments to study from the six published surveys 
of YRV (Lemos et al. 2008, 2010; Jiang et al. 2010; Paredes 
et al. 2011; Sackton et al. 201 1 ; Zhou et al. 2012). From these 
surveys, we selected the four sets of experiments where at 
least three Y chromosomes were compared on a common 
genetic background in D. melanogaster (table 1). 

For each experiment, we started with raw microarray 
data available at the NCBI Gene Expression Omnibus (GEO) 
database and then processed each set identically using limma 
in Bioconductor (Smyth 2005). We first background-corrected 
arrays using the "normexp" method (Ritchie et al. 2007) and 
then normalized with the "loess" method in limma (Smyth 
2004). After normalization, we filtered data first by removing 



Table 1 

Studies Included in Meta-Analysis, with GEO Information and Other Characteristics 



Study Name GEO Description Reference 

BL08 GSE9457 Original study reporting YRV: compared five geographically disparate Y chromosomes Lemos et al. (2008) 

BL10 GSE23612 YRV in XXY females Lemos et al. (2010) 

SP11 GSE27695 YRV in rDNA deletion lines Paredes et al. (2011) 

JZ12 GSE37068 YRV in mutation accumulation lines (Harwich) Zhou et al. (2012) 
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probes that did not have high-quality data in at least 25% of 
arrays for a given study and second by fitting array weights 
using the ArrayWeights function in limma (Ritchie et al. 2006) 
to downweigh lower quality arrays. In most cases, all replica- 
tion is biological, so we generate fits using the Imfit function in 
limma; for the JZ12 (GSE37068) data set, technical replicates 
were fit using the duplicateCorrelation function in limma. 

For each normalized expression set, we fit a linear model in 
limma with a design matrix calculated using the model Matrix 
function in limma and including a dye term and then extracted 
the fit coefficients for all possible pairwise contrasts among Y 
chromosomes. For each contrast, we calculated Cohen's 
effect size, d, as 

(2 x r)/sqrt(df), (1) 

using the degrees of freedom and moderated T statistic 
calculated by the eBayes function in limma. Within a study, 
all pairwise d statistics were averaged to generate an average 
pairwise d for each study. This value is equivalent to the 
expected difference, in units of standard deviations, between 
two Y chromosomes drawn at random from the pool of 
Y chromosomes included in a particular study. We calculated 
this average d statistic for the four D. melanogaster studies in 
table 1 and then averaged the average d statistics among the 
four studies to estimate an overall effect size for each gene. 
Complete R code for the normalization and analysis steps in 
limma is available from the authors upon request. 

In addition to calculating Cohen's d, we also calculated a 
combined P value using Stouffer's method, in which P values 
are first transformed into Z values before combining (Stouffer 
et al. 1 949). For each gene and each study, we tested the null 
hypothesis of equal expression across all Y introgression lines 
using the F statistic calculated by the limma functions ImFit 
and eBayes. We then combined P values for each gene across 
studies using the R function: 

pnorm(sum(qnorm(x))/sqrt(length(x))), (2) 

where x represents the vector of P values, pnorm is the normal 
distribution function, and qnorm is the normal quantile 
function. After combining P values, we applied a standard 
false discovery rate (FDR) multiple test correction in R using 
the p. adjust function (Benjamini and Hochberg 1995). 

We used these combined P values to generate our YRV 
gene set. We are interested in two kinds of YRV genes: 
those that are common across two or more studies and 
those that are specific to a single study. Because a combined 
P value can give a significant result either because a test is 
highly significant in one study but nonsignificant in the 
remaining studies or because a test is moderately significant 
in multiple studies, the combined P value alone cannot distin- 
guish these. To separate these two classes, we calculated 
leave-one-out combined P values, where we simply drop 
one of the studies before calculation. We define "specific 
YRV genes" as those where: 1) the combined P value that 
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Combined significance test 




Combined significance test 
when individually significant 
study is removed? 




Fig. 1. — Procedure for defining common and specific classes of YRV 
genes, based on the combined q value across studies plus the q values 
of individual studies. No genes with q < 0.05 in more than one study fail to 
achieve significance in the combined statistic. 



includes a given study is significant (q< 0.05), but the com- 
bined P values that exclude that study are not significant or 2) 
the individual multiple-test-corrected P value for a given study 
is significant but the combined P value is not significant. We 
define "common YRV genes" as all the remaining genes with 
a significant combined P value (fig. 1). 

Data Sources for Genomic Correlates 

Much of our analysis focuses on the analysis of a wide range 
of potential correlates to YRV, including components of gene 
structure, chromatin environment, gene expression and func- 
tion, and gene evolutionary patterns. The variables included in 
our analysis, and their sources, are listed in table 2. The full 
data set, including all covariates and all the calculated 
meta-analysis statistics from each study included, is provided 
as supplementary file S1, Supplementary Material online. 

Logistic Regression 

The fundamental logic of our approach is to use a logistic 
regression to ask which properties of genes are best at pre- 
dicting membership in the YRV class, as defined earlier. We 
used model selection techniques to find the best model 
among the large number of possible models that include 
one or more of the terms in table 2, as implemented in the 
R package glmulti (Calcagno 2012). The basic approach is to 
fit a series of main effect models using a logistic regression 
(glm, family = " binomial" in R) and find the best set of models 
based on an information criteria, here the Akaike Information 
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Criterion (AIC). Because the number of possible models is ex- 
tremely large for the number of parameters we examine, we 
used a genetic algorithm to search the space of possible 
models, implemented in glmulti using the default parameters. 
The size of the search space also limits our ability to test 
interactions. To increase our confidence in the output of the 
genetic algorithm, we ran the entire procedure twice and 
combined the results into a single consensus output using 
the consensus function in glmulti. Because the genetic algo- 
rithm is not an exhaustive search procedure, this has the effect 
of increasing the search space and thus increasing our confi- 
dence in the results, but the two runs produce similar results 
when considered individually. From this output, we selected 
the best model based on the importance of each term, 
defined as the proportion of the 200 best models in which 
each given term appears. Full R code is available from the 
authors upon request. 

Cross- Validation 

We used a leave-one-out cross-validation approach imple- 
mented in the R function cv.glm from R package boot 
(Canty and Ripley 2012) to validate our model. We define a 
cost function as: 

mean(abs(observed-predicted) > 0.5), (3) 

which is equivalent to an estimate of inaccuracy or the pro- 
portion of times the model misclassifies the data point left 
out from the leave-one-out cross-validation. Because the 
cross-validation approaches cannot handle data with missing 
values, we ran this analysis on only a subset of the full data set 
that excludes missing values; the results of running our full 
model on this data set are qualitatively identical to the results 
from running the full model on the original data set. 

Clustering Analysis 

We used two approaches for clustering analysis. To analyze 
clustering in the genome, we first divided each chromosome 
into windows of either 100kb or 500 kb. For each window, 
we counted the number of YRV genes in the window and 
then computed an empirical null distribution by permuting the 
assignment of YRV genes 10,000 times and, for each permu- 
tation, counting the number of shuffled YRV genes in each 
window. This permutation approach controls for variation in 
gene density across the genome. To test for clustering speci- 
fically, we asked whether the number of windows with a 
nominally significant (at a = 0.05) excess of YRV genes com- 
pared with the permutation null distribution is significantly 
greater than expected by chance. To test for clustering in 
nuclear space, we used the Hi-C data set from Sexton et al. 
(2012). This data set is based on an experiment in which DNA 
in the nucleus was cross-linked and then fragmented and 
sequenced, such that regions of the genome in close physical 
proximity are likely to be sequenced in the same fragment. 



These contacts can then be empirically scored between sliding 
windows across the genome. Because this contact count is 
heavily dependent on both chromosomal location and the 
chromatin context of the interacting pair, Sexton et al. 
(2012) developed a hierarchical model that corrects for 
these effects. To isolate the effects of YRV genes above and 
beyond these factors, we analyze contact counts normalized 
to the model expectation, rather than raw contact counts. 

Results 

Defining a Common Set of Genes Regulated by Y-Linked 
Variation 

Over the past 5 years, our laboratory has studied the role of 
variation on the Y chromosome on gene expression across a 
variety of contexts and experimental designs. To better under- 
stand the commonalities across study designs in the set of 
genes regulated by Y-linked variation (YRV genes), we used 
a meta-analysis approach. We focus on two related statistical 
approaches: effect size as measured by Cohen's d, and a 
combined P value based on Z scores (Stouffer's method). 
Effect size allows a comparison of the magnitude of an 
effect, in standardized units, across many studies; in this 
case, we calculate an effect size (d) that is equivalent to the 
expression difference in units of standard deviations of a pair 
of Y chromosomes drawn at random, averaged across all 
studies. A combined P value provides a statistically rigorous 
approach to use evidence from all studies to test an underlying 
common null hypothesis; here, a significant combined P value 
indicates statistical support for rejecting the null hypothesis 
that expression of the gene in question does not vary across 

Y introgression lines. We focus our analysis on the results from 
a meta-analysis of the four studies in table 1 and calculate 
both an effect size and a combined P value for all genes. 

On the basis of our combined P value approach (fig. 1), we 
identify a total of 678 genes that are susceptible to YRV, 
which we term YRV genes. Of these, 458 are "common," 
meaning that evidence for a role of Y-linked variation in 
their regulation comes from more than one study, and 220 
are "specific," meaning that one and only one study supports 
a role for Y-linked variation in their regulation. Although the 
"common" set includes a handful of genes with highly signif- 
icant evidence for YRV in all studies, in most cases these genes 
are not individually significant after multiple test correction in 
any studies; rather, the consistency of a trend across all studies 
provides power for the meta-analysis to identify a role for the 

Y chromosome (fig. 2). Nonetheless, we believe that these 
genes are robustly identified by our meta-analysis. Both the 
"specific" and "common" classes have significantly elevated 
effect sizes relative to the non-YRV class (fig. 3A; median 
d is 0.646 for "common," 0.612 for "specific," and 0.391 
for "none," Mann-Whitney U, P value < 2 x 1 0~ 16 for both 
comparisons). In the case of the "specific" genes, this is 
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typically driven by a very high effect size in the study where 
the gene is individually significant, as demonstrated by the 
high coefficient of variation of Cohen's d across studies 
for the "specific" class (median CV for "specific" = 0.253, 
compared with 0.0936 for "common" and 0.0682 for 
"none"; fig. 3fi). 

Although it is possible that some proportion of the genes 
in the "specific" class could represent genetic background 
effects, three of the four studies in our analysis use the 
same background stock. If genetic background effects were 
a major driver of the "specific" class, we would expect the 
single study that uses a different genetic background (Lemos 

300 

250 

200 

I 150H 
o 

100 
50 
0- 

0 12 3 4 

Number of individually significant studies 

Fig. 2. — The proportion of genes in the common class that are indi- 
vidually significant in 0, 1, 2, 3, or 4 of the underlying studies. Light bars are 
for a nominal (not multiple-test corrected) P value < 0.05. Dark gray bars 
are for a FDR-corrected q value < 0.05. 




et al. 2010) to include a disproportionate number of the 
"specific" genes, which does not appear to be the case: 
22.7% of the genes in the "specific" class are significant in 
the single study with a different genetic background, which is 
not significantly different from the 25% expected by chance 
(x 2 = 0.606 1 , df = 1 , P value = 0.4363). However, we do find 
a significant excess of "specific" genes in the study by Paredes 
et al. (2011) (40.9% vs. 25% expected; % 2 = 7.27, df=1, 
P value = 0.007), which examined Y chromosomes carrying 
severe rDNA mutations that might be expected to have dis- 
proportionate effects on gene expression. Thus, we suspect 
that the "specific" effects at least in part represent the fact 
that the Y chromosomes targeted in each study have quite 
different properties and thus may contain specific variation 
that is not observed across all studies; it may especially be 
the case that the severe mutations screened in the study by 
Paredes et al. (201 1) result in particularly severe distortions of 
gene expression. 

Taken as a whole, these results strongly suggest that YRV is 
a phenomenon with significant reach. Even assuming that 
only the "common" class represents robust YRV genes and 
that this study has uncovered all extant YRV (i.e., no false 
negatives), we have shown that variation on the Y chromo- 
some affects more than 3% of the protein-coding genes in 
the D. melanogaster genome. However, our analysis only con- 
siders the 4,271 genes where we had high-quality expression 
information across all studies. Thus, we find evidence for a role 
of YRV in gene expression variation for 15.9% of the genes 
tested, which, if extrapolated to the entire genome, would 
suggest that expression of as many as 2,000 genes could be 
affected by variation on the Y chromosome, either directly or 
indirectly. 
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common specific none 
Meta-analysis significance class 



common specific none 
Meta-analysis significance class 



Fig. 3. — (A) Boxplot showing the average Cohen's d statistic across studies for each significance class, where Cohen's d represents the standardized 
effect size of YRV. Common vs. none, P value < 2 x 10~ 16 (Mann-Whitney U); specific vs. none, P value < 2 x 10~ 16 (Mann-Whitney U). (6) Boxplot of the 
coefficient of variation (CV) of Cohen's d across studies for each significance class. Specific vs. common, Pvalue < 2 x 10~ 16 (Mann-Whitney U); specific vs. 
none, P value < 2 x 10~ 16 (Mann-Whitney U); and common vs. none, Pvalue = 8.31 x 10~ 14 (Mann-Whitney U). 
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Fig. 4. — Ranked AIC support for the 200 best models. Red line is 
placed at the AIC of the 10th best model. 

Genomic Correlates of YRV Susceptibility 

To better understand the basis for susceptibility to YRV, we 
built a logistic regression model to identify the properties of 
genes that are predictive of being in the YRV class. We began 
with a list of 19 variables (table 2). These include parameters 
representing gene expression, evolutionary history, gene 
structure, local genomic environment, and chromatin state. 
To select the best model, we used a genetic algorithm 
search approach implemented in the R package glmulti 
(Calcagno 2012), which uses a genetic algorithm to sample 
a very large number of first-order models (where the terms 
included in the model are a subset of the full model) and 
find the one that minimizes the AIC, a measure of the relative 
goodness of fit of the model. The AIC of the best 200 models 
is shown in figure 4. Because the best set of models are 
relatively close in AIC, we select parameters for the final 
model based on the proportion of times each parameter is 
present in the best 200 models, rather than the absolute best 
model (fig. 5). The final model, then, includes all parameters 
with a model-averaged importance of at least 80%, meaning 
that those terms appear in 80% or more of the 200 best 
models, and is: 

YRV ~1 + het + FbtrLen + Numlnt + tau.avg + mf.pd 
+ postmei. ratio + h2m + expdivtot.m 

(Model 1) 

Every one of these model terms is also included in the single 
best model by AIC, which is: 

YRV ~1 + het + FbtrLen + Numlnt + FirstlntLen + tau.avg 
+ mf.pd + postmei. ratio + h2m + expdivtot.m 

(Model 2) 

As a further test of the validity of this model, we performed 
leave-one-out cross-validation using the cv.glm function in the 
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Fig. 5. — Model-averaged importance of each term in the model 
(table 2), which is defined as the proportion of the 200 best models in 
which a given term appears. Red line indicates 80% support. Terms with 
an importance above the red line are included in our final model. 



R package boot (Canty and Ripley 2012). Leave-one-out 
cross-validation works by leaving out each data point in turn 
and predicting YRV membership of the dropped data point 
using the remaining data. From this procedure, we can calcu- 
late a prediction error term defined as the proportion of data 
points for which we fail to correctly predict the observed data. 
For the model generated from our model selection procedure 
(Model 1 ), the prediction error is 0. 1 264; for the best model by 
AIC (Model 2), the prediction error is 0.1335. 

The best predictors of membership in the YRV class 
are measures of the evolutionary rate of change in gene 
expression (expression divergence and mutational variance 
of expression), gene expression distribution across tissues 
and sexes (tau, M/F expression ratio, and ratio of postmeiotic 
to other expression in spermatogenesis), gene size (transcript 
length, number of introns, and average intron length), and 
chromatin state (fig. 5). Notably, rate of protein evolution 
(as measured by d N /d s ) and overall level of gene expression 
are not important predictors of membership in the YRV class 
(fig. 5). To estimate the magnitude and direction of the effect 
of these predictors on the probability of membership in the 
YRV class (p parameters), we reran the logistic regression 
including only those terms deemed important from the 
model selection procedure (table 2, last column). These results 
suggest that the typical YRV gene is one that is short, with few 
introns, in repressive chromatin, tissue specific, rapidly chang- 
ing in expression both at the population and evolutionary 
level, and expressed postmeiotically during spermatogenesis 
(table 3). 
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Table 3 

Model Parameters from the Final Model 



Term 


P Coefficient 


expdivtot.m 


0.50137 


h2m 


0.90826 


postmei. ratio 


2.28173 


mf.pcl 


-0.62495 


tau.avg 


2.92341 


FbtrLen 


-0.96739 


het 


0.58295 


Numlnt 


-0.13822 



YRV Genes Have Many Characteristics of Repressive 
Chromatin 

One possible mechanistic model for how variation on the 
Y chromosome regulates non-Y-linked gene expression is 
via effects on chromatin state induced by changes in the dis- 
tribution of DNA-binding proteins across the genome. To the 
extent that Y-linked sequences bind proteins associated 
with establishing chromatin states, variation in propensity of 
binding on the Y could impact the distribution of chromatin 
states across the genome. Thus, we were particularly inter- 
ested in the possibility that YRV genes are nonrandomly 
distributed with respect to chromatin state across the 
genome. To further investigate this possibility, we analyzed 
chromatin classes based on protein-binding profiles generated 
by Filion et al. (2010), who defined five chromatin states: 
GREEN (pericentric heterochromatin), BLUE (Polycomb hetero- 
chromatin), BLACK (intercalary heterochromatin), RED 
(active chromatin), and YELLOW (active chromatin). Using 
these classes, we compared effect sizes across chromatin 
states (fig. 6). Genes in the BLUE and BLACK classes, corre- 
sponding to Polycomb and intercalary heterochromatin, 
respectively, have significantly higher average effect sizes 
than genes in the GREEN (pericentric heterochromatin) class, 
or either of the active classes (YELLOW and RED) (fig. 6). The 
BLUE and BLACK classes, in particular, share a high affinity for 
binding of the Suppressor of Under-Replication (SuUR), Lam, 
and D1 proteins, suggesting the possibility that these proteins 
may play a role in YRV. 

If this hypothesis is correct, we would expect that YRV 
genes should be associated with regions of the genome that 
bind these proteins in independent studies. In Drosophila, 
B-type lamin (Lam) is a primary constituent of the nuclear 
lamina, the protein network that lines the inner surface of 
the nuclear envelope; lamins directly interact with chromatin 
and play a role in transcriptional regulation (reviewed 
in Marshall 2002). SuUR is associated with late-replicating 
regions of the genome and may play a role in transcriptional 
regulation as well (reviewed in Schwaiger and Schubeler 
2006). We thus took advantage of two additional data sets 
that independently addressed binding to the B-type lamin 
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Fig. 6. — Boxplot of Cohen's d averaged across studies for each of the 
five chromatin states defined by Filion et al. (201 0). Black, blue, and green 
are repressive chromatin; red and yellow are active chromatin. Boxplots 
with the same letter above them are not significantly different (a = 0.05; 
italics indicate difference at 0.05 < cc < 0.10), based on Tukey HSD 
P values, which indicate that: BLACK is greater than GREEN (P= 0.005), 
RED (P=0.000016), and YELLOW (P< 0.000001); BLUE is greater 
than GREEN (marginally so, P=0.07), RED (P= 0.024), and YELLOW 
(P< 0.000001); GREEN is not significantly different from RED or 
YELLOW; RED is greater than YELLOW (P= 0.001 3). 



(Pickersgill et al. 2006) and replication timing across the 
genome (Schwaiger et al. 2009). Because our logistic regres- 
sion suggests that YRV genes are preferentially localized to 
repressive chromatin, we would expect that YRV genes 
should be both over-represented in the set of genes bound 
to the nuclear lamina and also over-represented among 
late-replicating regions of the genome. 

We find support for both of these hypotheses. Genes in the 
YRV class have a significantly higher ratio of lamin bound to 
lamin unbound signal (YRV= 0.2425, non-YRV= -0.0821, 
both expressed as log 2 [lamin bound/lamin unbound]; 
Mann-Whitney U, Pvalue<2x 10~ 16 ). Although much of 
this effect is driven by the bias toward repressive chromatin 
states for YRV genes, it is notable that YRV genes in active chro- 
matin (RED or YELLOW states) have a significantly higher 
lamin bound/lamin unbound signal than non-YRV genes in 
active chromatin, suggesting that even in active chromatin 
states YRV genes have some repressive-chromatin-like prop- 
erties (in active chromatin: YRV = -0.07855, non-YRV = 
-0.226; Mann-Whitney U, P value = 0.0002; fig. 7). YRV 
genes are also much more likely to be late-replicating than 
non-YRV genes (Fisher's exact test P=1.87x 10~ 7 , odds 
ratio = 1 .612). Although much of this simply reflects the 
strong overlap between regions of the genome that are late 
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replicating and regions of the genome that are in repressive 
chromatin, it is again notable that, in active regions, YRV 
genes are later replicating than non-YRV genes (table 4), 
suggesting that YRV genes in active chromatin have a bias 
toward showing properties associated with repressive 
chromatin. However, these patterns are relatively weak 
compared with the overall bias toward excess YRV genes in 
late-replicating regions of the genome. 

YRV Genes Are Closer than Expected in the Nucleus but 
Are Not Clustered along Chromosomes 

Given the hypothesis that YRV genes are regulated via effects 
on chromatin state, it is plausible that they might be physically 
clustered along the chromosome, as chromatin domains 
are often larger than single genes (e.g., Filion et al. 2010). 
Although previous studies of individual data sets have occa- 
sionally showed relatively modest evidence for clustering 
along chromosome (e.g., Jiang et al. 2010; Zhou et al. 
2012), other individual studies have failed to find this pattern 
(Paredes et al. 2011), and our reanalysis does not show 
compelling evidence for clustering along chromosomes. 
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Fig. 7. — Lamin-binding ratio on a log 2 scale for YRV and non-YRV 
genes sorted into active and repressive chromatin domains. Box widths are 
proportional to the share of each YRV class that is in active vs. repressive 
chromatin: Approximately two-thirds of YRV genes are in repressive chro- 
matin, whereas the opposite is true of non-YRV genes. Although within 
repressive chromatin, YRV and non-YRV genes are not significantly differ- 
ent (Mann-Whitney U, P= 0.9689), in active chromatin, YRV genes have 
significantly higher lamin-binding ratios (Mann-Whitney U, P— 0.0002). 



We calculated, based on 1 ,000 random permutations of the 
distribution of YRV genes in the genome, the probability that 
windows of either 1 00 kb or 500 kb have a significant excess 
of YRV genes. For 500-kb windows, we do not find an excess 
of nominally significant windows beyond that expected 
by chance, and for 100-kb windows, we actually find a 
significant deficit of windows with excess YRV genes 
(500-kb windows: 8/244 significant tests, % 2 = 1.522, 
df=1, P value = 0.2 17; 100-kb windows: 26/1,208 signifi- 
cant tests, x 2 = 20.6232, df = 1, Pvalue=5.59x 10~ 16 ). 

Recent technological innovations (Hi-C) have made it pos- 
sible to measure the physical proximity in the nucleus of DNA 
segments genome wide (Sexton et al. 2012). Because the 
physical conformation of DNA in the nucleus appears to have 
large impacts on the local accessibility of DNA sequence, we 
used this Hi-C data to ask whether YRV genes are closer to 
each other than non-YRV genes are to each other in nuclear 
space. To do this, we classified each pair of contacts identi- 
fied by Sexton et al. (2012) as being between two YRV 
genes, being between a YRV gene and a non-YRV gene, 
or not involving YRV genes. Because physical, 
three-dimensional proximity estimated from Hi-C approaches 
is predicted by both proximity along a chromosome (adjacent 
loci on a chromosome are likely to be adjacent in the nu- 
cleus), and on whether a locus lies in a more tightly packed 
repressive domain or a less tightly packed active domain 
(Sexton et al. 2012), we normalized the observed contact 
counts for each pair to a hierarchical domain model 
(Sexton et al. 2012), which takes into account both linear 
sequence distance and different trends within active and 
repressive domains (Sexton et al. 2012). This normalized dis- 
tance then represents the relative excess or deficit of contacts 
observed compared with the model expectation and controls 
for both proximity along the chromosome and broad-scale 
chromatin context. Sexton et al. (2012) calculate contacts by 
dividing the genome into bins ranging in size from 20 to 
160kb and then counting observed contacts by mapping 
reads to each bin. We focus on the 80 kb bin size for 
simplicity, but substantially similar results are obtained for 
all bin sizes. 

The fraction of YRV/YRV pairs with an excess of observed 
contacts (i.e., the number of observed contacts is higher 



Table 4 

Counts of YRV and Non-YRV Genes that Are in Early- or Late-Replicating Regions of the Genome, by Chromatin State 





Active Chromatin 


Repressive Chromatin 




Non-YRV 


YRV 


Non-YRV YRV 


Early replicating (geneRT > 0) 
Late replicating (geneRT < 0) 


1,996 
269 


204 
40 


521 207 
623 196 




Fisher's exact test (active) 


P= 0.0507 
Odds ratio = 1.45 


Fisher's exact test (repressive) P— 0.0485 

Odds ratio = 0.792 
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than the model prediction) is significantly higher than the 
fraction of non-YRV/non-YRV pairs with an excess of observed 
contacts (47.3% vs. 42.1%, % 2 = 1961 .175, df=1, 
P value < 2 x 1CT 16 ). Average normalized contacts between 
YRV genes are moderately but significantly higher than 
between non-YRV genes, indicating more observed pairs 
than expected under the model (median normalized YRV/ 
YRV contacts, on a log 2 scale = -0.05, median normalized 
non-YRV/non-YRV contacts on a log 2 scale = -0.1 6, Mann- 
Whitney U, P value < 2 x 10~ 16 ). Furthermore, YRV genes are 
on average closer to other YRV genes than to non-YRV genes 
(median normalized YRV/YRV contacts, on a log 2 
scale = -0.05, median normalized YRV/non-YRV contacts, 
on a log 2 scale = -0.1 21, Mann-Whitney U, P value 
<2 x 10~ 16 ). Together, these results imply that YRV genes 
are in closer proximity in the nucleus than non-YRV genes on 
average, over and above the tighter packing predicted by the 
tendency of YRV genes to fall into repressive chromatin. 

Discussion 

Despite growing evidence that the Y chromosome plays a 
significant role in regulating gene expression across many 
genes, and the implication that this phenomenon may under- 
lie the role of variation on the Y chromosome in phenotypic 
variation for traits such as thermal tolerance of spermatogen- 
esis (David et al. 2005), male fitness (Chippindale and Rice 
2001), male fecundity (Sackton et al. 2011), and geotaxis 
(Stoltenberg and Hirsch 1997), we still have little understand- 
ing of what kinds of genes are susceptible to YRV. Do YRV 
genes share common properties that implicate a shared mech- 
anistic basis, or are they idiosyncratic? Can the properties of 
the genes regulated by variation on the Y shed any light on the 
mechanistic basis for this phenomenon? 

In this study, we use meta-analysis approaches to bring 
together the previous work done on YRV with new genomic 
data sets available as a result of the modENCODE project and 
other large-scale genomic screens. These analyses reveal a 
core set of common properties that distinguish YRV genes: 
These genes are more tissue specific, diverge more rapidly in 
expression in intra- and inter-specific contexts, are more likely 
to be located in repressive chromatin, and tend to be shorter 
with fewer introns than the average gene. YRV genes are 
also clustered in physical space in the nucleus but not clearly 
so (or only weakly so) along the chromosome. 

Although some of these traits are of obvious interest 
(chromatin state, which we address in the next section), 
others are harder to interpret. It is not entirely obvious 
why genes affected by variation on the Y chromosome 
would be more tissue specific, except possibly as byproduct 
of a potential role of chromatin state in regulating expression 
of nonhousekeeping genes (many tissue-specific genes 
are in repressive chromatin; Filion et al. 2010; Kharchenko 
et al. 2011). The observation that YRV genes are more 



likely to be expressed postmeiotically in spermatogenesis 
provides a tantalizing link to our previous observations that 
Y chromosome divergence between D. simulans and 
D. sechellia seems to have strongly affected the regulation 
of a number of postmeiotic spermatogenesis genes 
(Sackton et al. 201 1). A possible link between high mutation 
rates of Y-linked repetitive DNA (Lohe and Roberts 2000, 
1990) and YRV could be invoked to interpret the connection 
between mutational variance, YRV, and gene expression 
divergence. 

Of particular interest is the observation that YRV genes 
are more likely to be located in repressive chromatin than 
non-YRV genes. This observation is supported by other prop- 
erties associated with repressive chromatin: YRV genes in gen- 
eral are later replicating than non-YRV genes, which is a 
common correlate of repressive chromatin. Similar to other 
regions of repressive chromatin, YRV genes are also more 
likely to be bound to the nuclear lamina than non-YRV genes. 

We were particularly intrigued to note that, based on 
the chromatin classification scheme of Filion et al. (2010), 
YRV genes are primarily biased toward the two classes of 
nonpericentric heterochromatin (BLACK and BLUE). These 
two classes of chromatin share high levels of binding of 
three proteins (D1, SuLJR, and LAM) that in turn distinguish 
them from other chromatin states. Although we were not 
able to find completely independent verifications of D1 or 
SuLJR binding, we were able to confirm that YRV genes 
share a significant excess of binding to LAM in an independent 
data set. 

The protein D1 is an AT-hook protein, containing a struc- 
tural motif (the AT hook) that is known to bind to AT-rich 
sequences (Levinger 1985; Aulner et al. 2002). D1 has been 
shown in vitro and in vivo to bind to AT-rich satellite motifs 
in D. melanogaster, including the SATI and SATIN repeats 
that are localized to, among other regions, the Y chromosome 
(Aulner et al. 2002; Monod et al. 2002; Blattes et al. 2006). 
Binding of D1 is not, however, limited to the repetitive se- 
quences, as recent high-throughput studies demonstrate D1 
binding to dispersed euchromatin regions of the genome 
(Filion et al. 2010). Although an exact function for D1 is un- 
known, it is hypothesized to be a general transcriptional reg- 
ulator (Levinger 1985; Smith and Weiler 2010). 

These findings suggest a possible hypothesis for the basis 
of YRV, which we refer to as the heterochromatic sink model: 
If variation on the Y chromosome exists for the extent of D1 
binding, this could change the genomic distribution of the D1 
protein between Y introgression lines. Given the potential role 
of D1 in transcriptional regulation, this alone may be sufficient 
to influence gene expression. The implication is that the bind- 
ing of chromosomal associated proteins to the Y chromosome 
alters their binding elsewhere, which in turn modifies gene 
expression profiles. In support of this model, we find that YRV 
genes have significantly more AT-rich upstream regions 
than non-YRV genes (YRV: 56.9% AT, non-YRV: 54.4% AT, 
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Fig. 8. — Boxplot of AT content of upstream regions adjacent to YRV 
and non-YRV genes. Mann-Whitney U, P value = 8.44 x 1CT 5 . 



Mann-Whitney U, P value = 8.44 x 1CT 5 ; fig. 8), implying 
that targets of D1 binding may be more prevalent proximate 
to YRV genes. The obvious implication of this is that we should 
be able to detect a difference in D1 binding in vivo across 

Y introgression lines, and future experiments are underway 
to test exactly this. It is important to note that although D1 is 
an obvious candidate given its binding properties, it is likely 
that other DNA-binding proteins may play an important role in 
this model. 

An alternate, and potentially complementary, hypothesis 
is suggested by the observation that YRV genes both tend 
to occupy a particular place in nuclear space (near the nuclear 
envelope) and that YRV genes are significantly more clustered 
in physical nuclear space than linear space along the chromo- 
some. In this model, which we call the spatial arrangement 
model, variation on the Y chromosome impacts the packing 
of chromosomes into the nucleus and thus the physical 
propensity for YRV genes to be in accessible or inaccessible 
regions of the genome. 

In both the heterochromatin sink model and the spatial 
arrangement model, which are not mutually exclusive, the 

Y chromosome plays a role in modifying how the genome 
is distributed across chromatin compartments. This may be 
particularly important in the case of genes that are only 
expressed in limited contexts (postmeiotically in spermatogen- 
esis, in single tissues), as expression of these genes may be 
particularly sensitive to small changes in the propensity to shift 
from silent to active chromatin contexts. Further work is 
needed, however, to explicitly test this hypothesis in an exper- 
imental context. 
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Supplementary file S1 is available at Genome Biology and 
Evolution online (http://www.gbe.oxfordjournals.org/). 
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